Detailed Summary: Asynchronous Methods for Deep Reinforcement Learning (A3C)
Citation: Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Tim Harley, Timothy P. Lillicrap, David Silver, Koray Kavukcuoglu. Google DeepMind / Montreal Institute for Learning Algorithms (MILA). ICML 2016. arXiv:1602.01783v2.
Abstract (Paraphrased)
This paper presents a conceptually simple and lightweight framework for deep RL using asynchronous gradient descent for optimization of deep neural network controllers. The key idea is running multiple actor-learners in parallel on independent environment instances, using their diverse experience to decorrelate training data without experience replay. Four asynchronous algorithms are proposed: one-step Q-learning, one-step Sarsa, n-step Q-learning, and Asynchronous Advantage Actor-Critic (A3C). A3C surpasses state-of-the-art on Atari using half the training time of prior work, running on a single multi-core CPU rather than a GPU.
Motivation
The Non-Stationarity Problem in Online Deep RL
When a single agent interacts sequentially with an environment, consecutive observations (s_t, a_t, r_t, s_{t+1}) are highly correlated. Training a deep neural network directly on such correlated sequences produces unstable, divergent learning because the data distribution shifts as the policy changes.
The dominant solution before this paper was experience replay (used in DQN): store transitions in a large replay memory, then sample random mini-batches to break temporal correlations. This works but has three drawbacks:
- Requires storing millions of transitions (high memory).
- Requires more computation per real environment step.
- Can only be applied to off-policy algorithms (Q-learning, DQN) because sampled transitions may come from a different policy than the current one.
This restriction prevented using on-policy algorithms (Sarsa, actor-critic, REINFORCE) with deep networks.
The Insight: Parallelism as a Decorrelator
If multiple agents independently interact with separate environment instances simultaneously, at any given moment the agents are experiencing different states. The resulting training data is naturally diverse. This eliminates the need for a replay buffer entirely and allows both on-policy and off-policy deep RL to be trained stably.
An additional benefit: wall-clock training time scales (near-linearly for some methods) with the number of parallel workers, since more environment interactions occur per unit time.
RL Background
Standard Formulation
- At each timestep t, agent observes state s_t, selects action a_t ~ pi(a_t|s_t), receives reward r_t, transitions to s_{t+1}.
- Return from time t: R_t = sum_{k=0}^{inf} gamma^k * r_{t+k}, discount factor gamma in (0,1].
- Goal: maximize E[R_t] from each state s_t.
Value Functions
- Action-value: Q^pi(s,a) = E[R_t | s_t=s, a_t=a]
- Optimal: Q*(s,a) = max_pi Q^pi(s,a)
- State-value: V^pi(s) = E[R_t | s_t=s]
One-Step Q-Learning Loss
L_i(theta_i) = E[(r + gamma * max_{a'} Q(s',a';theta_{i-1}) - Q(s,a;theta_i))^2]
Policy Gradient (REINFORCE)
Policy gradient: nabla_theta E[R_t] = nabla_theta log pi(a_t|s_t;theta) * R_t
Variance-reduced version using baseline b_t(s_t) approx V^pi(s_t):
gradient = nabla_theta log pi(a_t|s_t;theta) * (R_t - b_t(s_t))
The quantity (R_t - b_t(s_t)) = A(a_t, s_t) is the advantage of action a_t in state s_t — how much better it is than average.
This leads to the actor-critic architecture: actor = policy pi, critic = value function V^pi (the baseline).
Asynchronous RL Framework (Section 4)
Two Core Design Choices
Choice 1: Multi-thread on a single machine. Unlike Gorila (which uses 100 separate actor-learner processes and 30 parameter servers across many machines), A3C runs all workers as threads on a single multi-core CPU. This eliminates inter-machine communication cost and allows lock-free Hogwild!-style parameter sharing via shared memory.
Choice 2: Diverse exploration policies per thread. Each thread samples its exploration parameter epsilon from a distribution, creating naturally diverse data across workers. This further reduces temporal correlation.
Four Algorithms
Asynchronous One-Step Q-Learning (Algorithm 1)
Each thread:
- Takes action with epsilon-greedy policy based on local Q(s,a;theta').
- Computes gradient: d(y - Q(s,a;theta))^2 / d theta, where y = r (terminal) or r + gamma * max_{a'} Q(s',a';theta^-) (non-terminal).
- Accumulates gradient into d_theta.
- Asynchronously updates global theta using d_theta every I_AsyncUpdate steps.
- Periodically syncs target network theta^- <- theta every I_target steps.
Asynchronous One-Step Sarsa
Same as above but uses on-policy target: y = r + gamma * Q(s',a';theta^-) where a' is the next action taken by the policy (not the max).
Asynchronous n-Step Q-Learning (Algorithm S2)
Operates in the forward view: accumulate up to t_max steps of experience, then compute multi-step return:
R = r_t + gamma*r_{t+1} + ... + gamma^{n-1}*r_{t+n-1} + gamma^n * max_a Q(s_{t+n},a;theta^-)
Compute gradient for each state-action pair in the window, accumulate, then apply in one async update.
Asynchronous Advantage Actor-Critic — A3C (Algorithm S3)
The headline algorithm. Maintains two networks sharing most parameters:
- Actor: policy pi(a_t|s_t; theta)
- Critic: value function V(s_t; theta_v)
Per-thread pseudocode:
repeat:
Reset gradients d_theta = 0, d_theta_v = 0
Sync local params: theta' = theta, theta_v' = theta_v
t_start = t
Get state s_t
repeat:
Select a_t ~ pi(a_t|s_t; theta')
Receive r_t, s_{t+1}
t++, T++
until terminal s_t or t - t_start == t_max
if terminal: R = 0
else: R = V(s_t; theta_v') // bootstrap from last state
for i = t-1, ..., t_start:
R = r_i + gamma * R
d_theta += nabla_{theta'} log pi(a_i|s_i;theta') * (R - V(s_i;theta_v'))
d_theta_v += d(R - V(s_i;theta_v'))^2 / d theta_v'
Async update theta using d_theta
Async update theta_v using d_theta_v
until T > T_max
Entropy regularization: The full policy gradient objective includes an entropy bonus:
nabla_{theta'} log pi(a_t|s_t;theta') * (R_t - V(s_t;theta_v')) + beta * nabla_{theta'} H(pi(s_t;theta'))
where H is the entropy of the policy distribution and beta=0.01. This discourages premature convergence to deterministic policies.
System Architecture
Shared Memory (Global Parameters)
+-------------------------------------+
| theta (policy network weights) |
| theta_v (value network weights) |
| g (shared RMSProp moving average) |
| T (global step counter) |
+-------------------------------------+
^ ^ ^ ^ ^ ^ ^ ^
| | | | | | | |
Thread-1 Thread-2 ... Thread-N
+------+ +------+ +------+
|Env | |Env | |Env |
|copy | |copy | |copy |
| | | | | |
|theta'| |theta'| |theta'|
|theta_| |theta_| |theta_|
|v' | |v' | |v' |
+------+ +------+ +------+
Collects Collects Collects
t_max steps t_max steps t_max steps
Computes Computes Computes
gradients gradients gradients
Async Async Async
update update update
global global global
theta theta theta
Each thread maintains a thread-local copy of parameters (theta', theta_v'). Parameters are synchronized from the global at the start of each t_max-step rollout. Gradients from the rollout are applied asynchronously to the global — no locking, no coordination.
Neural Network Architecture
For Atari experiments:
- Conv layer: 16 filters, 8x8, stride 4, ReLU
- Conv layer: 32 filters, 4x4, stride 2, ReLU
- Fully connected: 256 hidden units, ReLU
- Actor head: softmax over actions (one output per action)
- Critic head: single linear output (V(s))
- All non-output layers shared between actor and critic
- LSTM variant: adds 256 LSTM cells after the last fully connected layer
For MuJoCo continuous control:
- Encoder: two convolutional layers -> 128 LSTM cells (for pixel input); or single FC layer with 200 ReLU units (for physical state input)
- Actor head: mean vector mu (linear) + variance sigma^2 (SoftPlus)
- Policy samples from N(mu, sigma^2 * I)
Optimizer: Shared RMSProp
Standard non-centered RMSProp update:
g = alpha*g + (1-alpha)*delta_theta^2
theta = theta - eta * delta_theta / sqrt(g + epsilon)
Three variants compared:
- Momentum SGD: per-thread momentum vector.
- RMSProp: per-thread moving average g.
- Shared RMSProp: g shared across all threads, updated asynchronously without locking.
Shared RMSProp is consistently more robust to learning rate and initialization variance than the other two. All experiments use Shared RMSProp with alpha=0.99.
Evaluation
Setup
- Hardware: Single machine with 16 CPU cores; no GPU.
- Comparison baseline: DQN trained on a single Nvidia K40 GPU (8 days).
- Atari: 57 games in the Arcade Learning Environment (ALE). Human starts evaluation metric.
- TORCS: 3D racing simulator, 4 speed/bot configurations.
- MuJoCo: Rigid body physics, multiple locomotion/manipulation tasks.
- Labyrinth: Randomly generated 3D mazes, visual input only.
- Hyperparameter search: 50 random seeds per game; learning rate sampled from LogUniform(1e-4, 1e-2); all other hyperparameters fixed.
Key Results
Atari 57 Games (Table 1, Human-Start Metric)
| Method | Training | Mean (%) | Median (%) |
|---|---|---|---|
| DQN | 8 days GPU | 121.9 | 47.5 |
| Gorila | 4 days, 100 machines | 215.2 | 71.3 |
| Double DQN | 8 days GPU | 332.9 | 110.9 |
| Dueling Double DQN | 8 days GPU | 343.8 | 117.1 |
| Prioritized DQN | 8 days GPU | 463.6 | 127.6 |
| A3C FF (1 day) | 1 day CPU | 344.1 | 68.2 |
| A3C FF | 4 days CPU | 496.8 | 116.6 |
| A3C LSTM | 4 days CPU | 623.0 | 112.6 |
A3C LSTM achieves state-of-the-art mean score with no GPU, matching Double DQN's median in half the time.
Scalability (Table 2): Training Speedup vs. Number of Threads
| Method | 1 thread | 2 | 4 | 8 | 16 |
|---|---|---|---|---|---|
| 1-step Q | 1.0 | 3.0 | 6.3 | 13.3 | 24.1 |
| 1-step Sarsa | 1.0 | 2.8 | 5.9 | 13.1 | 22.1 |
| n-step Q | 1.0 | 2.7 | 5.9 | 10.7 | 17.2 |
| A3C | 1.0 | 2.1 | 3.7 | 6.9 | 12.5 |
N-step methods scale slightly sublinearly (due to higher gradient variance from multi-step returns). A3C scales worse than value-based methods but provides the best final performance.
Robustness (Fig. 2)
Scatter plots of final scores over 50 learning rates and random initializations show large "good performance" regions with no zero-score collapses — the algorithm does not diverge.
TORCS
A3C reaches 75–90% of human tester score across four configurations (slow/fast car, with/without bots) in ~12 hours of training.
MuJoCo
A3C solves all tested tasks within 24 hours using CPU; typical solution time is a few hours.
Labyrinth (3D Maze)
A3C LSTM reaches average score ~50 on randomly generated mazes using only 84x84 pixel visual input, demonstrating general exploration strategy learning.
RL Formulation Table
| Component | A3C Specification |
|---|---|
| Agent | Actor-critic neural network (shared CNN/LSTM backbone) |
| Environment | Atari ALE / TORCS / MuJoCo / Labyrinth; one copy per thread |
| State (s_t) | Last 4 game frames (84x84 grayscale) for Atari; RGB frame / physical state for others |
| Action (a_t) | Discrete: argmax of softmax policy output; Continuous: sample from N(mu, sigma^2) |
| Reward (r_t) | Game score delta (clipped to [-1,+1] for Atari); velocity reward for TORCS |
| Return (R) | n-step discounted return R = r_t + gamma*r_{t+1} + ... + gamma^{n-1}*r_{t+n-1} + gamma^n * V(s_{t+n}; theta_v') |
| Advantage | A(a_t,s_t) = R - V(s_t; theta_v') |
| Policy update | nabla_{theta'} log pi(a_t |
| Value update | Minimize (R - V(s_t;theta_v'))^2 |
| Algorithm | On-policy policy gradient with advantage baseline (actor-critic) |
| Optimizer | Shared RMSProp, alpha=0.99, initial lr ~ LogUniform(1e-4, 1e-2) |
| Discount gamma | 0.99 |
| t_max | 5 (for Atari); episode length for MuJoCo |
| Entropy beta | 0.01 |
| Workers | 16 CPU threads |
Limitations
Stale gradients: Threads apply gradients computed from parameters that may be several updates old. Empirically stable, but no convergence guarantees with non-linear function approximation.
Data efficiency: Without replay, each transition is used exactly once. Off-policy methods with replay can reuse transitions many times, potentially learning faster per environment interaction.
Thread scaling for A3C: Sublinear speedup (12.5x at 16 threads vs. 24x for 1-step Q) due to higher-variance multi-step advantage estimates.
Hyperparameter sensitivity (relative): Although more robust than DQN, A3C still requires tuning learning rate, t_max, entropy coefficient beta, and epsilon schedule.
No prioritization: Unlike Prioritized DQN, A3C does not focus on high-surprise transitions; in sparse reward environments this may slow learning.
Related Work
- DQN (Mnih et al., 2015): Experience replay + target network; foundational GPU-based deep RL.
- Gorila (Nair et al., 2015): Distributed DQN on 100 machines + 30 parameter servers; high infrastructure cost.
- Double DQN (Van Hasselt et al., 2015): Reduces Q-value overestimation; orthogonal to A3C.
- Prioritized Experience Replay (Schaul et al., 2015): Can be incorporated into A3C via replay buffer.
- TRPO (Schulman et al., 2015a): Trust region policy optimization; stronger convergence guarantees but more complex.
- Hogwild! (Recht et al., 2011): Lock-free asynchronous SGD; theoretical basis for A3C's update mechanism.
- Tsitsiklis (1994): Convergence of asynchronous Q-learning; theoretical underpinning.
Relevance to DynamICCL
High direct relevance. A3C is a foundational algorithm that directly informs DynamICCL's Config Agent design:
Direct algorithm alternative to DQN: DynamICCL's Agent-2 currently uses DQN. A3C (particularly the LSTM variant) is a strong alternative for the following reasons:
- DynamICCL's NCCL configuration environment is non-stationary (congestion fluctuates); A3C's on-policy, replay-free design adapts more quickly to shifting distributions.
- A3C with LSTM naturally captures temporal patterns in congestion signals, complementing Agent-1's CUSUM detection.
- A3C's actor-critic structure explicitly separates "how good is this state" (V(s)) from "how good is this action" (A(s,a)), which is valuable for DynamICCL because collective completion time is a noisy, state-dependent signal.
Multi-worker parallelism for NCCL config exploration: DynamICCL operates in a cluster environment where multiple nodes run collective operations simultaneously. Each node could host a separate A3C worker exploring NCCL configurations, sharing a global policy — this is precisely the A3C framework applied to the DynamICCL training problem.
Entropy regularization for exploration: NCCL's configuration space has 7 algorithms * 3 protocols * 8 channel counts = 168+ discrete combinations. A3C's entropy bonus encourages exploration of this discrete space, preventing premature convergence to a locally-optimal but globally suboptimal configuration.
Advantage function for variance reduction: Collective completion times have high variance due to network jitter and background traffic. The advantage function A(s,a) = R - V(s) subtracts a state-dependent baseline, reducing gradient variance — critical for DynamICCL's noisy reward signal.
CPU-only training on Chameleon Cloud: DynamICCL trains on bare-metal HPC nodes. A3C's single-machine multi-thread CPU-based training eliminates the need for dedicated GPU training infrastructure during the RL training phase, leaving GPU resources free for the actual collective workloads being optimized.
LSTM for temporal state: Agent-1 (Trigger Agent) uses LSTM+CUSUM. An A3C-LSTM Config Agent could jointly maintain temporal state about recent collective performance, correlating past congestion signals with configuration outcomes — a more unified architecture than two separate agents.